Loading all the necessary libraries which are needed for this project. A brief explanation of each of the libraries used in this R Markdown script:

Abstract

This research project aimed to identify the factors associated with depression and poor mental health rates in the United States. The study analyzed data from multiple sources, including the Behavioral Risk Factor Surveillance System, American Community Survey, and COVID-19 Community Profile Report. The data were used to build multiple linear regression models and regression trees and conduct stepwise selection to identify the optimal combination of features. The study found that demographic, socioeconomic, and health-related variables significantly predict depression and poor mental health rates. The models can serve as a basis for further research on this topic and assist policymakers in identifying high-risk populations and designing targeted interventions. However, the study has limitations such as the possibility of measurement bias, limited geographical areas, and not accounting for potential confounding variables. Further research could explore additional factors related to depression and conduct qualitative research to understand better the subjective experiences of individuals living with depression. Overall, this study provides insights into the factors associated with depression and poor mental health rates in the United States and can inform future research and policy decisions.

rm(list=ls())
library(readr)
library(ggplot2)
library(ggpubr)
library(tidyr)
library(corrplot)
library(ezids)
library(car)
library(rpart)
library(rpart.plot)
library(rattle)
library(tree)
url <- 'https://raw.githubusercontent.com/eitanaka/DATS6101_Final_Project_Team2/main/data_set/geo_socio_health_df.csv'
master_df<- read_csv(url)
master_df <- master_df[,-1]
# To rename some columns in a data frame to make them more readable and easier to work with.This is done by using the "colnames" function to first select the column names that match the original names, and then assigning new names.
colnames(master_df)[colnames(master_df) == "MT_Never Married"] <- "mt.nev.mar"
colnames(master_df)[colnames(master_df) == "MT_Now married"] <- "mt.now.mar"
colnames(master_df)[colnames(master_df) == "Total Population"] <- "tot.pop"
colnames(master_df)[colnames(master_df) == "EA_Less than high school graduate"] <- "ea.less.hs.deg"
colnames(master_df)[colnames(master_df) == "EA_High school graduate"] <- "ea.hs.deg"
colnames(master_df)[colnames(master_df) == "EA_college or associate's degree"] <- "ea.col.ass.deg"
colnames(master_df)[colnames(master_df) == "EA_Bachelor's degree"] <- "ea.ba.deg"
colnames(master_df)[colnames(master_df) == "EA_Graduate or professional degree"] <- "ea.grad.prof.deg"
# Create a new data only including numerical variable
numeric_vars <- sapply(master_df, is.numeric)
num_df <- master_df[, numeric_vars]

1. Introduction

Using a range of datasets, our previous project focused on the relationship between health risk behaviors: lack of physical activity and sleep, and health outcomes (depression) and status (poor mental health). In this research project, we investigate the influence of other health conditions, socioeconomic factors, and the economic impact of covid-19 on mental health in the United States. We aim to determine which factors significantly impact mental health and well-being in the US during 2020. To do so, we will analyze four different datasets at the census tract level, utilizing various statistical techniques such as correlation, linear regression, and decision tree analysis. The results of this study have important implications for improving mental health outcomes and shaping public health policies and interventions. We can develop targeted interventions and programs to address these issues by identifying the factors that most significantly impact mental health.

1.1. Data Sets Descritption

Our research project utilizes four different datasets to investigate the influence of health conditions, socioeconomic factors, and the economic impact of COVID-19 on mental health outcomes in the United States. One of these datasets is from our previous project (Project 1), while the other is collected specifically for this study.

  1. CDC_PLACE dataset measures health outcomes, prevention, health risk behaviors, and health status. This dataset contains 13 health outcomes, nine prevention measures, four health risk behaviors, and three health status measures. The original dataset was launched by the Centers for Disease Control and Prevention (CDC)- In 2020, the dataset provided small area estimates for counties, places, census tracts, and ZIP Code Tabulation Areas across the United States. Each measure has a comprehensive definition that includes the background, significance, limitations of the indicator, data source, and limitations of the data resources.

  2. The American Community Survey 2020 is another dataset used in our study. This ongoing survey provides detailed information on the population and housing in the US yearly. The ACS helps local officials, community leaders, and businesses understand the changes in their communities. Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics.

  3. Planning Database 2020 contains operational, demographic, and socioeconomic statistics from the 2010 Census at both the tract and block group levels.CEII data also includes annualized monthly estimates of county-level value added for over 100 industries. Counties with economic activities dominated by industries experiencing rising unemployment can expect more enormous direct impacts on their local economies, notably if the industries account for a large portion of the economic output of that county.

  4. County Economic Impact Index 2020 estimates the overall county-level economic activity change during the COVID-19 pandemic relative to 2020. These four datasets and the six others were analyzed to investigate the relationship between mental health outcomes and various socioeconomic and lifestyle factors, as well as the impact of COVID-19.

1.2. Research / SMART Questions

Our guiding questions for this project are the following:

SQ1: What factors do we observed highly associated with depression and poor mental health rates?

SQ 2: Using those factors, how accurately can we infer depression and poor mental health rates in a given tract?

1.3. Data Preparation

Data cleaning and preparation is a crucial step in any data analysis project. In this step, we ensure that the data is in a format suitable for analysis and that any inconsistencies, errors, or missing values are corrected or removed. Our project merged the four data sources by countyFIPs and Tract level Geographical ID. This allowed us to combine information from different sources based on their standard identifiers. We also renamed variables to make them more meaningful and easier to understand. After merging and renaming variables, we removed all rows, including null values. This was important because null values could interfere with our analysis and lead to inaccurate conclusions. We also eliminated all outliers to reduce the influence of extreme values on our analysis. Once we completed these cleaning steps, we had a data frame with 12,444 observations of 57 variables. This data frame was ready for further analysis and exploration.

2. EDA

In this section, we perform EDA on a dataset containing health and socioeconomic factors for different tracts in the United States. The purpose of the analysis is to investigate the relationship between these factors and two health outcomes: depression rate and poor mental health rate. We check the data quality, including missing values, outliers, and inconsistencies. We then examine the distributions of the variables and their relationships using visualizations such as histograms, scatterplots, and correlation matrices. The EDA reveals that some variables are highly correlated with depression rates and poor mental health rates. Overall, the EDA helps us to understand the data, identify potential issues, and guide our subsequent analysis. The insights gained from this analysis have important implications for public health policy and model building.

2.1. Data Structure & Data Types

The data structure of the dataset is a table with 60,682 rows and 57 columns. The columns represent various variables, including health indicators, demographic characteristics, and economic indicators. The data types of the variables are numeric, except for CountyFIPS, GEOID, StateAbbr, and CountyName. The dataset has also omitted some rows due to missing data, as indicated by the "na. action" attribute. The dataset is a comprehensive health and demographic information collection for a large population.
str(num_df)
## tibble [60,962 × 57] (S3: tbl_df/tbl/data.frame)
##  $ ACCESS2                           : num [1:60962] 15 19.9 12.5 18 19.6 16.7 16.5 21.1 12.7 17.8 ...
##  $ BINGE                             : num [1:60962] 15.8 14.1 15.2 14.9 14.9 15.5 14.9 12.5 16.6 14.9 ...
##  $ CHECKUP                           : num [1:60962] 74.1 76.5 75.5 74.6 74 74.3 75.2 78.3 73.9 75.5 ...
##  $ DEPRESSION                        : num [1:60962] 25.9 23.9 24 26.9 28.5 26.2 25.9 24.5 24.6 25.9 ...
##  $ DIABETES                          : num [1:60962] 10.7 13.4 10.2 12.8 12.8 11.3 12.8 17.4 9.7 13.8 ...
##  $ LPA                               : num [1:60962] 26.4 32.1 23 31.1 33.4 28.2 29.8 37.2 22.8 29.6 ...
##  $ MHLTH                             : num [1:60962] 16.4 17.3 14 17.9 19.1 17 17 17.7 15.1 16.9 ...
##  $ OBESITY                           : num [1:60962] 37 43.9 33.2 41.1 41 37.9 40.3 46.1 35.7 36.2 ...
##  $ PHLTH                             : num [1:60962] 11.3 12.1 9.8 13.5 14.5 11.6 12.8 15.4 9.8 13 ...
##  $ SLEEP                             : num [1:60962] 36.9 43.4 33.4 39.5 39.8 38.1 39.2 43.8 35.8 37.6 ...
##  $ STROKE                            : num [1:60962] 3.1 3.7 3 3.8 4 3.3 3.8 5.3 2.7 4 ...
##  $ mt.nev.mar                        : num [1:60962] 23.7 39.5 20.9 27.5 36.9 ...
##  $ mt.now.mar                        : num [1:60962] 54.8 31.8 63.8 52.9 37 ...
##  $ MT_Divorces                       : num [1:60962] 11.97 18.56 10.09 9.62 18.61 ...
##  $ MT_Separated                      : num [1:60962] 1.39 2.62 0 1.17 1.61 ...
##  $ MT_Widowed                        : num [1:60962] 8.11 7.54 5.27 8.87 5.9 ...
##  $ ea.less.hs.deg                    : num [1:60962] 14.29 10.59 6.23 12.86 17.91 ...
##  $ ea.hs.deg                         : num [1:60962] 33.8 50.3 28.1 31.2 38.4 ...
##  $ ea.col.ass.deg                    : num [1:60962] 27 23.2 26.4 34.3 28 ...
##  $ ea.ba.deg                         : num [1:60962] 15.4 12.9 22.8 12.2 10.8 ...
##  $ ea.grad.prof.deg                  : num [1:60962] 9.4 2.97 16.43 9.5 4.93 ...
##  $ MI_Estimate                       : num [1:60962] 26504 26173 37529 20669 24181 ...
##  $ tot.pop                           : num [1:60962] 1941 1757 3539 3536 3562 ...
##  $ CT_<10                            : num [1:60962] 17.68 5.65 13.98 21.38 19.81 ...
##  $ CT_10-14                          : num [1:60962] 27.6 17.54 8.91 13.07 12.08 ...
##  $ CT_15-19                          : num [1:60962] 7.02 3.48 13.78 16.25 15.53 ...
##  $ CT_20-24                          : num [1:60962] 14.29 7.68 14.37 10.42 20.91 ...
##  $ CT_25-29                          : num [1:60962] 5.57 0.87 8.26 5.39 6.07 ...
##  $ CT_30-34                          : num [1:60962] 18.8 39.9 22.1 18.6 16.1 ...
##  $ CT_35-44                          : num [1:60962] 7.02 14.2 7.87 7.51 3.59 ...
##  $ CT_45-59                          : num [1:60962] 1.69 8.41 3.64 3.27 4.55 ...
##  $ CT_>60                            : num [1:60962] 0.363 2.319 7.087 4.152 1.38 ...
##  $ ES_Total_labor_force              : num [1:60962] 54.7 50.3 54.5 44.9 58.9 ...
##  $ ES_Civilian_labor_force           : num [1:60962] 54.7 49.4 54.2 43.5 58.3 ...
##  $ ES_Civilian_labor_force_employed  : num [1:60962] 53.5 47.4 52.9 42.3 53.5 ...
##  $ ES_Civilian_labor_force_unemployed: num [1:60962] 1.16 2 1.27 1.21 4.78 ...
##  $ ES_Armed_Forces                   : num [1:60962] 0 0.828 0.327 1.423 0.56 ...
##  $ ES_Not_in_labor_force             : num [1:60962] 45.3 49.7 45.5 55.1 41.1 ...
##  $ land                              : num [1:60962] 3.79 1.29 2.46 3.1 8.65 ...
##  $ urban                             : num [1:60962] 1594 2170 4386 3595 2505 ...
##  $ poverty                           : num [1:60962] 11.34 17.88 2.85 21.58 30.5 ...
##  $ no.ins                            : num [1:60962] 9.26 9.37 2.82 14.87 16.23 ...
##  $ disab                             : num [1:60962] 17.6 17.1 19.6 25.7 24.7 ...
##  $ no.comp                           : num [1:60962] 11.23 17.36 7.6 8.45 9.54 ...
##  $ broad&comp                        : num [1:60962] 80.9 79.2 86 87.6 85.1 ...
##  $ no.eng                            : num [1:60962] 1.83 0.56 0.98 0.68 0.99 0 0 0 0 0 ...
##  $ sing.mom                          : num [1:60962] 14.12 17.25 7.69 9.38 27.48 ...
##  $ live.alone                        : num [1:60962] 22.9 37.3 28.4 18.7 28.5 ...
##  $ pub.assist                        : num [1:60962] 0 1.11 1.22 0.68 4.51 2.16 1.27 0.45 1.47 0 ...
##  $ no.phone                          : num [1:60962] 2.88 2.36 2.5 0.45 0.78 0.69 0.49 4.91 0.88 0.71 ...
##  $ no.plumb                          : num [1:60962] 0 2 0.86 3.47 0 ...
##  $ married.kid                       : num [1:60962] 36 49.2 36.3 43.6 50.1 ...
##  $ hhd.no.comp                       : num [1:60962] 18.3 27.8 10.4 11.6 13.1 ...
##  $ hhd.only.phone                    : num [1:60962] 6.01 5.7 4.27 5.41 3.59 ...
##  $ hhd.no.int                        : num [1:60962] 26.5 30.5 17.5 17.2 17.7 ...
##  $ hhd.broad                         : num [1:60962] 59.7 55.1 74.6 65.4 66 ...
##  $ index_apr20                       : num [1:60962] 0.913 0.913 0.913 0.913 0.913 ...

2.2. Summary Statistics

The table displays the summary statistics of various health and socioeconomic indicators in 2020. The mean values for several health indicators, such as depression and obesity, are high, indicating potential health concerns in the population. Education levels vary, with 12.2% having less than a high school degree and 28.8% having some college or associate’s degree. The poverty rate is 15.3%, and 9.1% of the population does not have health insurance. Additionally, 13.3% of households do not have a computer, and 28.1% of individuals live alone.

summary(num_df)
##     ACCESS2         BINGE         CHECKUP       DEPRESSION      DIABETES   
##  Min.   : 2.6   Min.   : 2.7   Min.   :49.7   Min.   : 8.5   Min.   : 0.6  
##  1st Qu.: 9.8   1st Qu.:14.7   1st Qu.:70.6   1st Qu.:17.8   1st Qu.: 8.5  
##  Median :13.3   Median :16.7   Median :74.8   Median :20.3   Median :10.3  
##  Mean   :15.5   Mean   :16.8   Mean   :74.0   Mean   :20.4   Mean   :10.9  
##  3rd Qu.:19.2   3rd Qu.:18.7   3rd Qu.:77.7   3rd Qu.:22.9   3rd Qu.:12.6  
##  Max.   :64.9   Max.   :36.6   Max.   :90.8   Max.   :37.8   Max.   :42.2  
##                                                                            
##       LPA           MHLTH         OBESITY         PHLTH          SLEEP     
##  Min.   : 7.8   Min.   : 6.1   Min.   :11.5   Min.   : 2.3   Min.   :19.8  
##  1st Qu.:18.7   1st Qu.:13.1   1st Qu.:28.0   1st Qu.: 8.4   1st Qu.:30.7  
##  Median :23.7   Median :15.0   Median :33.1   Median :10.3   Median :33.5  
##  Mean   :24.6   Mean   :15.1   Mean   :33.0   Mean   :10.8   Mean   :34.0  
##  3rd Qu.:29.5   3rd Qu.:16.9   3rd Qu.:37.7   3rd Qu.:12.7   3rd Qu.:36.7  
##  Max.   :63.7   Max.   :33.0   Max.   :63.9   Max.   :31.3   Max.   :54.4  
##                                                                            
##      STROKE        mt.nev.mar      mt.now.mar     MT_Divorces     MT_Separated 
##  Min.   : 0.20   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   : 0.0  
##  1st Qu.: 2.40   1st Qu.: 24.6   1st Qu.: 37.7   1st Qu.:  7.9   1st Qu.: 0.6  
##  Median : 3.00   Median : 31.5   Median : 48.2   Median : 10.7   Median : 1.5  
##  Mean   : 3.19   Mean   : 34.2   Mean   : 46.7   Mean   : 11.2   Mean   : 1.9  
##  3rd Qu.: 3.70   3rd Qu.: 41.4   3rd Qu.: 57.1   3rd Qu.: 13.9   3rd Qu.: 2.7  
##  Max.   :20.50   Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :41.8  
##                  NA's   :17      NA's   :17      NA's   :17      NA's   :17    
##    MT_Widowed   ea.less.hs.deg   ea.hs.deg    ea.col.ass.deg    ea.ba.deg    
##  Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.: 3.8   1st Qu.: 4.9   1st Qu.:19.6   1st Qu.: 23.4   1st Qu.: 10.8  
##  Median : 5.6   Median : 9.4   Median :28.4   Median : 29.1   Median : 17.2  
##  Mean   : 6.1   Mean   :12.2   Mean   :27.8   Mean   : 28.8   Mean   : 19.1  
##  3rd Qu.: 7.7   3rd Qu.:16.6   3rd Qu.:36.1   3rd Qu.: 34.5   3rd Qu.: 26.3  
##  Max.   :57.9   Max.   :83.0   Max.   :88.8   Max.   :100.0   Max.   :100.0  
##  NA's   :17     NA's   :25     NA's   :25     NA's   :25      NA's   :25     
##  ea.grad.prof.deg  MI_Estimate        tot.pop          CT_<10     
##  Min.   :  0.0    Min.   :  2499   Min.   :    0   Min.   :  0.0  
##  1st Qu.:  4.7    1st Qu.: 25068   1st Qu.: 2805   1st Qu.:  6.0  
##  Median :  8.7    Median : 31500   Median : 3868   Median : 10.5  
##  Mean   : 12.1    Mean   : 34406   Mean   : 4006   Mean   : 13.5  
##  3rd Qu.: 16.4    3rd Qu.: 40680   3rd Qu.: 5037   3rd Qu.: 17.4  
##  Max.   :100.0    Max.   :204750   Max.   :39373   Max.   :100.0  
##  NA's   :25       NA's   :71                       NA's   :93     
##     CT_10-14        CT_15-19        CT_20-24        CT_25-29       CT_30-34    
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   : 0.0   Min.   :  0.0  
##  1st Qu.:  7.9   1st Qu.:  9.8   1st Qu.:  9.2   1st Qu.: 3.2   1st Qu.:  8.2  
##  Median : 12.2   Median : 14.3   Median : 13.5   Median : 5.7   Median : 12.7  
##  Mean   : 13.6   Mean   : 15.2   Mean   : 14.1   Mean   : 6.3   Mean   : 13.3  
##  3rd Qu.: 17.9   3rd Qu.: 19.7   3rd Qu.: 18.3   3rd Qu.: 8.7   3rd Qu.: 17.6  
##  Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :64.6   Max.   :100.0  
##  NA's   :93      NA's   :93      NA's   :93      NA's   :93     NA's   :93     
##     CT_35-44        CT_45-59         CT_>60      ES_Total_labor_force
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0       
##  1st Qu.:  3.1   1st Qu.:  3.4   1st Qu.:  3.5   1st Qu.: 57.3       
##  Median :  6.0   Median :  6.9   Median :  7.0   Median : 63.7       
##  Mean   :  6.8   Mean   :  8.0   Mean   :  9.3   Mean   : 62.7       
##  3rd Qu.:  9.6   3rd Qu.: 11.3   3rd Qu.: 12.6   3rd Qu.: 69.3       
##  Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :100.0       
##  NA's   :93      NA's   :93      NA's   :93      NA's   :17          
##  ES_Civilian_labor_force ES_Civilian_labor_force_employed
##  Min.   :  0.0           Min.   :  0.0                   
##  1st Qu.: 57.0           1st Qu.: 53.1                   
##  Median : 63.4           Median : 59.9                   
##  Mean   : 62.2           Mean   : 58.7                   
##  3rd Qu.: 69.0           3rd Qu.: 65.6                   
##  Max.   :100.0           Max.   :100.0                   
##  NA's   :17              NA's   :17                      
##  ES_Civilian_labor_force_unemployed ES_Armed_Forces ES_Not_in_labor_force
##  Min.   : 0.0                       Min.   : 0.0    Min.   :  0.0        
##  1st Qu.: 1.8                       1st Qu.: 0.0    1st Qu.: 30.7        
##  Median : 3.0                       Median : 0.0    Median : 36.3        
##  Mean   : 3.6                       Mean   : 0.4    Mean   : 37.3        
##  3rd Qu.: 4.6                       3rd Qu.: 0.0    3rd Qu.: 42.7        
##  Max.   :33.4                       Max.   :99.4    Max.   :100.0        
##  NA's   :17                         NA's   :17      NA's   :17           
##       land           urban          poverty          no.ins         disab      
##  Min.   :    0   Min.   :    0   Min.   :  0.0   Min.   : 0.0   Min.   :  0.0  
##  1st Qu.:    1   1st Qu.:    3   1st Qu.:  6.6   1st Qu.: 4.1   1st Qu.:  9.3  
##  Median :    2   Median : 2989   Median : 12.1   Median : 7.3   Median : 12.6  
##  Mean   :   47   Mean   : 2793   Mean   : 15.3   Mean   : 9.1   Mean   : 13.5  
##  3rd Qu.:   10   3rd Qu.: 4368   3rd Qu.: 20.9   3rd Qu.:12.3   3rd Qu.: 16.7  
##  Max.   :85426   Max.   :31777   Max.   :100.0   Max.   :74.7   Max.   :100.0  
##  NA's   :18      NA's   :18      NA's   :146     NA's   :29     NA's   :111    
##     no.comp        broad&comp        no.eng         sing.mom    
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:  3.4   1st Qu.: 76.4   1st Qu.:  0.0   1st Qu.:  7.3  
##  Median :  6.8   Median : 85.0   Median :  1.6   Median : 11.3  
##  Mean   :  8.5   Mean   : 82.6   Mean   :  4.6   Mean   : 13.3  
##  3rd Qu.: 11.6   3rd Qu.: 91.6   3rd Qu.:  5.4   3rd Qu.: 17.3  
##  Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :100.0  
##  NA's   :164     NA's   :164     NA's   :164     NA's   :164    
##    live.alone      pub.assist      no.phone        no.plumb      married.kid   
##  Min.   :  0.0   Min.   : 0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.: 20.4   1st Qu.: 0.8   1st Qu.:  0.8   1st Qu.:  0.0   1st Qu.: 33.4  
##  Median : 27.1   Median : 1.8   Median :  1.7   Median :  0.9   Median : 40.7  
##  Mean   : 28.1   Mean   : 2.7   Mean   :  2.2   Mean   :  2.4   Mean   : 41.2  
##  3rd Qu.: 34.6   3rd Qu.: 3.6   3rd Qu.:  3.1   3rd Qu.:  3.0   3rd Qu.: 48.8  
##  Max.   :100.0   Max.   :53.5   Max.   :100.0   Max.   :100.0   Max.   :100.0  
##  NA's   :164     NA's   :171    NA's   :164     NA's   :163     NA's   :213    
##   hhd.no.comp    hhd.only.phone    hhd.no.int      hhd.broad    
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:  5.9   1st Qu.:  2.1   1st Qu.:  8.7   1st Qu.: 54.1  
##  Median : 10.9   Median :  4.7   Median : 15.6   Median : 68.0  
##  Mean   : 12.5   Mean   :  5.9   Mean   : 17.4   Mean   : 65.7  
##  3rd Qu.: 17.2   3rd Qu.:  8.4   3rd Qu.: 23.8   3rd Qu.: 79.4  
##  Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :100.0  
##  NA's   :164     NA's   :164     NA's   :164     NA's   :164    
##   index_apr20   
##  Min.   :0.360  
##  1st Qu.:0.855  
##  Median :0.885  
##  Mean   :0.883  
##  3rd Qu.:0.916  
##  Max.   :1.194  
## 

2.3. Handle Null Values

When we check for null values, we confirm that there are null values in the data set in the following table. We then confirmed that they were a small number relative to the overall data, so we removed them all. To remove rows with null values in the columns listed below, we use the “na.omit” function. This function removes rows with any missing values in the specified columns. After removing the rows with null values, the resulting dataset only contains rows with complete data in the specified columns.

colSums(is.na(num_df))
##                            ACCESS2                              BINGE 
##                                  0                                  0 
##                            CHECKUP                         DEPRESSION 
##                                  0                                  0 
##                           DIABETES                                LPA 
##                                  0                                  0 
##                              MHLTH                            OBESITY 
##                                  0                                  0 
##                              PHLTH                              SLEEP 
##                                  0                                  0 
##                             STROKE                         mt.nev.mar 
##                                  0                                 17 
##                         mt.now.mar                        MT_Divorces 
##                                 17                                 17 
##                       MT_Separated                         MT_Widowed 
##                                 17                                 17 
##                     ea.less.hs.deg                          ea.hs.deg 
##                                 25                                 25 
##                     ea.col.ass.deg                          ea.ba.deg 
##                                 25                                 25 
##                   ea.grad.prof.deg                        MI_Estimate 
##                                 25                                 71 
##                            tot.pop                             CT_<10 
##                                  0                                 93 
##                           CT_10-14                           CT_15-19 
##                                 93                                 93 
##                           CT_20-24                           CT_25-29 
##                                 93                                 93 
##                           CT_30-34                           CT_35-44 
##                                 93                                 93 
##                           CT_45-59                             CT_>60 
##                                 93                                 93 
##               ES_Total_labor_force            ES_Civilian_labor_force 
##                                 17                                 17 
##   ES_Civilian_labor_force_employed ES_Civilian_labor_force_unemployed 
##                                 17                                 17 
##                    ES_Armed_Forces              ES_Not_in_labor_force 
##                                 17                                 17 
##                               land                              urban 
##                                 18                                 18 
##                            poverty                             no.ins 
##                                146                                 29 
##                              disab                            no.comp 
##                                111                                164 
##                         broad&comp                             no.eng 
##                                164                                164 
##                           sing.mom                         live.alone 
##                                164                                164 
##                         pub.assist                           no.phone 
##                                171                                164 
##                           no.plumb                        married.kid 
##                                163                                213 
##                        hhd.no.comp                     hhd.only.phone 
##                                164                                164 
##                         hhd.no.int                          hhd.broad 
##                                164                                164 
##                        index_apr20 
##                                  0
num_df <- na.omit(num_df)

2.4. Check for and remove outliers

Next, we are handling outliers in the numerical data. We apply the outlierKD2 function from the “ezids” package to detect and remove the outliers in the dataset. We loop through all columns and remove outliers in each column by calling the outlierKD2 function. We remove any resulting NAs and update the data frame after each column. Finally, we remove the last two columns, which are added by outlierKD2. Handling outliers is essential in creating an accurate model because outliers can significantly impact the results, especially in regression models, and can lead to poor performance and inaccurate predictions.

new_num_df <- outlierKD2(num_df, num_df[[1]], rm=TRUE, boxplt=F, histogram=F,qqplt=F)
new_num_df <- new_num_df[,1:(ncol(new_num_df)-2)]

# loop through all columns
for (col_name in colnames(num_df)[2:ncol(num_df)]) {
  # remove outliers
  new_num_df <- outlierKD2(new_num_df, new_num_df[[col_name]], rm=T, boxplt=F, histogram=F,qqplt=F)
  new_num_df <- na.omit(new_num_df)
  new_num_df <- new_num_df[,1:(ncol(new_num_df)-2)]
}
new_num_df <- new_num_df[,1:(ncol(new_num_df)-1)]

2.5. Boxplot Analysis

The box plots show the distribution of various health, demographic, and socioeconomic factors. The plots suggest that obesity, diabetes, and depression rates are relatively high, while education levels and median household income are moderate. Looking at the distribution of all variables, it was confirmed that the distribution was relatively normal, thanks to removing outliers.

par(mfrow=c(8, 8))
par(mar=c(1, 1, 1, 1))
for (i in 1:ncol(new_num_df)) {
  boxplot(new_num_df[[i]], main=names(new_num_df)[i])
}

2.6. Scatter plot

We create scatter plots to visualize the relationship between depression and other variables, poor mental health, and other variables. The scatter plots show the distribution of each variable and the trend of the relationship between the variables. The data used in the plots come from a survey that collected information on various demographic, socioeconomic, and health-related factors.

2.6.1. Scatter plot for depression

Based on the scatter plot analysis of depression, it appears that depression is positively associated with “OBESITY,” “PHLTH,” and “ea.hs.deg.” On the other hand, there seems to be a negative association between depression and “MI_Estimate,” “ea.grad.prof.deg,” “CT_>60,” and “had.broad,” indicating that as these variables increase, the level of depression decreases. The strength of these associations varies, with “OBESITY,” “PHLTH,” and “MI_Estimate” having relatively stronger positive correlations with depression, while “ea.hs.deg” and “ea.grad.prof.deg” have weaker negative correlations.

theme_set(theme_pubr(base_size = 3.5))
plot_list <- list()
for (col in names(new_num_df)) {
  if (col != "DEPRESSION") {
    plot_data <- data.frame(x = new_num_df[[col]], y = new_num_df$DEPRESSION)
    plot <- ggplot(plot_data, aes(x = x, y = y)) + 
      geom_point(size=0.3) +
      geom_smooth(method = "lm", se = FALSE) +
      xlab(col) +
      theme(legend.position = "none")
    plot_list[[col]] <- plot
  }
}
ggarrange(plotlist = plot_list, ncol = 7, nrow = 8)

2.6.2. Scatter plot for poor mental heealht

The scatter plots with a fitted line for poor mental health (MHLTH) and the above highly correlated variables show that MHLTH positively correlates with PHLTH, OBESITY, and poverty. The scatter plot for PHLTH and MHLTH shows a clear positive linear relationship.

# Scatter plots with fit line for mhlth and other dependent variables
theme_set(theme_pubr(base_size = 3.5))
plot_list <- list()
for (col in names(new_num_df)) {
  if (col != "MHLTH") {
    plot_data <- data.frame(x = new_num_df[[col]], y = new_num_df$MHLTH)
    plot <- ggplot(plot_data, aes(x = x, y = y)) + 
      geom_point(size=0.3) +
      geom_smooth(method = "lm", se = FALSE) +
      xlab(col) +
      theme(legend.position = "none")
    plot_list[[col]] <- plot
  }
}
ggarrange(plotlist = plot_list, ncol = 7, nrow = 8)

2.7. Distribution of dependent variable

Next, we aim to determine the distribution of the dependent variables, depression rate, and poor mental health rate, to confirm which model would be appropriate for further analysis. The results showed that both depression and poor mental health rates were bell-shaped, indicating a normal distribution. This was confirmed through histograms and Q-Q plots, with both variables showing a close-to-straight-line pattern. The normal distribution of the dependent variables meets the assumption of linear regression, indicating that a linear regression model can be used for further analysis. This finding is significant as it allows for the exploration of the relationship between the dependent variables and the independent variables. Using linear regression can provide valuable insight into potential risk factors for depression and poor mental health, allowing for the development of targeted interventions to improve mental health outcomes.

# Histograms
hist(new_num_df$DEPRESSION, main = "Distribution of Tract-level Depression Rates")

hist(new_num_df$MHLTH, main = "Distribution of Tract-level Poor Mental Health Rates")

# QQ plots for the distribution of tract-level depression rates and tract-level poor mental health rates.

qqnorm(new_num_df$DEPRESSION, main = "Distribution of Tract-level Depression Rates")
qqline(new_num_df$DEPRESSION)

qqnorm(new_num_df$MHLTH, main = "Distribution of Tract-level Poor Mental Health Rates")
qqline(new_num_df$MHLTH)

2.8. Correlations Test

We begin the process of selecting variables for our multiple linear regressions by generating correlation matrices for all of our health and socioeconomic variables. Doing this enables us to select only variables significantly associated with depression or poor mental health rates. We use a 0.35 Pearson’s correlation coefficient to generate both correlation matrices. Numerous moderate associations are revealed by doing this. However, it is notable that the associations with poor mental health are more robust than the associations with depression. The variables associated with depression all have correlation coefficients of |+- 0.35| or greater, while those associated with poor mental health all have correlation coefficients of |+- 0.55| or greater. Both health conditions are associated with some combination of health and socioeconomic variables.

# Create a correlation matrix
cor_matrix <- cor(new_num_df)

# Create two lists which have the names of variables highly correlated (more then 0.3 or less than -0.3)
high_dep_cor_list <- names(which(cor_matrix["DEPRESSION",] > 0.35 | cor_matrix["DEPRESSION",] < -0.35))
high_mhlth_cor_list <- names(which(cor_matrix["MHLTH",] > 0.5 | cor_matrix["MHLTH",] < -0.5))

high_dep_cor_list <- high_dep_cor_list[high_dep_cor_list != "MHLTH"]
high_mhlth_cor_list <- high_mhlth_cor_list[high_mhlth_cor_list != "DEPRESSION"]

# Create two correlation matrixes from new_num_df in the above lists
high_dep_cor_mat <- cor(new_num_df[high_dep_cor_list])
high_mhlth_cor_mat <- cor(new_num_df[high_mhlth_cor_list])

# Plot above correlation matrix
corrplot(high_dep_cor_mat, type = "lower", outline.color = "white", 
           colors = c("#6D9EC1", "white", "#E46726"), legend.title = "Correlation", 
           ggtheme = theme_gray, title = "Highly Correlated Variables with DEPRESSION")

corrplot(high_mhlth_cor_mat, type = "lower", outline.color = "white", 
           colors = c("#6D9EC1", "white", "#E46726"), legend.title = "Correlation", 
           ggtheme = theme_gray, title = "Highly Correlated Variables with mhlth")

my_col <- colorRampPalette(c("red", "white", "blue"))(30)

corrplot(high_dep_cor_mat, method = "number", type = "lower", order = "hclust", diag = FALSE, tl.col = "black", tl.cex = 0.8, cl.cex = 0.8, addCoef.col = "black", col = my_col, main = "", mar = c(0,0,0,0))

corrplot(high_mhlth_cor_mat, method = "number", type = "lower", order = "hclust", diag = FALSE, tl.col = "black", tl.cex = 0.8, cl.cex = 0.8, addCoef.col = "black", col = my_col, main = "", mar = c(0,0,0,0))

2.9. VIF Test for Multicollinearity

Next, we check the selected variables for multicollinearity by using a Variance Inflation Factor (VIF) test. If multiple independent variables have a multicollinear relationship, we cannot use them for our multiple regressions. Fortunately, all of our selected independent variables have VIF values below 5 (except for LPA with MHLTH, VIF value of 5.02). This means that there is little multicollinearity between our independent variables, and we can use them all in our subsequent multiple regressions.

calculate_VIF <- function(data, target_col) {
  X <- data[, !colnames(data) %in% target_col]
  vif <- data.frame(
    Feature = colnames(X),
    VIF = apply(X, 2, function(x) vif(lm(x ~ ., data=X)))
  )
  return(vif)
}

depression_VIF <- calculate_VIF(new_num_df[high_dep_cor_list], "DEPRESSION")

mhlth_VIF <- calculate_VIF(new_num_df[high_mhlth_cor_list], "MHLTH")

depression_VIF
##                           Feature VIF.OBESITY VIF.PHLTH VIF.ea.hs.deg
## OBESITY                   OBESITY        1.97      1.97          1.97
## PHLTH                       PHLTH        2.87      2.87          2.87
## ea.hs.deg               ea.hs.deg        2.57      2.57          2.57
## ea.grad.prof.deg ea.grad.prof.deg        2.71      2.71          2.71
## MI_Estimate           MI_Estimate        2.65      2.65          2.65
## `CT_>60`                   CT_>60        1.14      1.14          1.14
## hhd.broad               hhd.broad        1.98      1.98          1.98
##                  VIF.ea.grad.prof.deg VIF.MI_Estimate VIF.CT_.60 VIF.hhd.broad
## OBESITY                          1.97            1.97       1.97          1.97
## PHLTH                            2.87            2.87       2.87          2.87
## ea.hs.deg                        2.57            2.57       2.57          2.57
## ea.grad.prof.deg                 2.71            2.71       2.71          2.71
## MI_Estimate                      2.65            2.65       2.65          2.65
## `CT_>60`                         1.14            1.14       1.14          1.14
## hhd.broad                        1.98            1.98       1.98          1.98
mhlth_VIF
##                           Feature VIF.LPA VIF.OBESITY VIF.PHLTH VIF.SLEEP
## LPA                           LPA    5.02        5.02      5.02      5.02
## OBESITY                   OBESITY    2.07        2.07      2.07      2.07
## PHLTH                       PHLTH    4.33        4.33      4.33      4.33
## SLEEP                       SLEEP    2.11        2.11      2.11      2.11
## ea.hs.deg               ea.hs.deg    3.74        3.74      3.74      3.74
## ea.ba.deg               ea.ba.deg    3.68        3.68      3.68      3.68
## ea.grad.prof.deg ea.grad.prof.deg    2.70        2.70      2.70      2.70
## MI_Estimate           MI_Estimate    2.83        2.83      2.83      2.83
## poverty                   poverty    1.78        1.78      1.78      1.78
##                  VIF.ea.hs.deg VIF.ea.ba.deg VIF.ea.grad.prof.deg
## LPA                       5.02          5.02                 5.02
## OBESITY                   2.07          2.07                 2.07
## PHLTH                     4.33          4.33                 4.33
## SLEEP                     2.11          2.11                 2.11
## ea.hs.deg                 3.74          3.74                 3.74
## ea.ba.deg                 3.68          3.68                 3.68
## ea.grad.prof.deg          2.70          2.70                 2.70
## MI_Estimate               2.83          2.83                 2.83
## poverty                   1.78          1.78                 1.78
##                  VIF.MI_Estimate VIF.poverty
## LPA                         5.02        5.02
## OBESITY                     2.07        2.07
## PHLTH                       4.33        4.33
## SLEEP                       2.11        2.11
## ea.hs.deg                   3.74        3.74
## ea.ba.deg                   3.68        3.68
## ea.grad.prof.deg            2.70        2.70
## MI_Estimate                 2.83        2.83
## poverty                     1.78        1.78

2.10. Feature Selection

After going through this initial testing, we are left with seven variables for our depression regression [Obesity (%), Poor Physical health >= 14 days(%), (Last Academic Record) High School Graduate (%), (Academic Record) Graduate or professional degree (%), Median Income (US dollar), Commute Time >= 60 mins (%), and Broadband access (%] and eight variables for our poor mental health regression [Lack of Physical Activity (%), Obesity (%), Poor Physical health >= 14 days(%), Poor Sleep >= 14 days (%), (Last Academic Record) High School Graduate (%), (Academic Record) Graduate or professional degree (%), Median Income (US$), and Below Poverty Level (%)].

dep_features_list <- high_dep_cor_list[high_dep_cor_list != "DEPRESSION"]
mhlth_features_list <- high_mhlth_cor_list[high_mhlth_cor_list != "MHLTH"]

dep_features_list
## [1] "OBESITY"          "PHLTH"            "ea.hs.deg"        "ea.grad.prof.deg"
## [5] "MI_Estimate"      "CT_>60"           "hhd.broad"
mhlth_features_list
## [1] "LPA"              "OBESITY"          "PHLTH"            "SLEEP"           
## [5] "ea.hs.deg"        "ea.ba.deg"        "ea.grad.prof.deg" "MI_Estimate"     
## [9] "poverty"

3. Model Building

3.1. Training and Test Sets

At this point, we split our initial data set into training and testing data sets. We “train” our multiple linear regressions and regression trees on this training set. We then test whether our models are over or underfit by feeding the testing data set into them and comparing the results to the output we got from feeding the initial training data into them.

library(caret)
set.seed(123)
fold <- floor(runif(nrow(new_num_df),1,11)) 
  new_num_df$fold <- fold
  
test.set <- new_num_df[new_num_df$fold == 1,] 
train.set <- new_num_df[new_num_df$fold != 1,] 

3.2. Multiple Linear Regressio

Model 1: With Highly correlated variables (Health Factors + Socioeconomic Factors)

# multiple linear regression model for depression
depression_model_1 <- lm(DEPRESSION ~ OBESITY + PHLTH + ea.hs.deg + ea.grad.prof.deg + MI_Estimate + `CT_>60` + hhd.broad, data = train.set)

# multiple linear regression model for poor mental health
mhlth_model_1 <- lm(MHLTH ~ LPA + OBESITY + PHLTH + SLEEP + ea.hs.deg + ea.ba.deg + ea.grad.prof.deg + MI_Estimate + poverty, data = train.set)

summary(depression_model_1)
summary(mhlth_model_1)

plot(depression_model_1)

plot(mhlth_model_1)

3.2.1. Model Comparison (ANOVA)

We experiment with three different multiple linear regressions and compare their performance using Analysis of Variance (ANOVA) testing. The first of these regressions (Model 1) predicts depression and poor mental health rates by incorporating all of the highly correlated Barbies (for each condition) into the regressions. The second regression (Model 2) only uses a cluster of highly correlated health variables to predict depression and poor mental health rates. Furthermore, finally, the third regression (Model 3) uses a cluster of highly correlated socioeconomic variables to predict depression and poor mental health rates.

Naturally, Model 1 performs the best overall, incorporating the most significant number of highly correlated variables. In addition, it has a lower residual sum of squares (RSS) than the other models, indicating that it fits the data best and that there is the least amount of unexplained error in the dependent variable variation when using Model 1. In other words, Model 1 can explain more of the variation in depression and mental health rates than the other two models.

Nevertheless, we are more interested in comparing the performance of Model 2 to Model 3. This comparison shows that Model 2 (the health variables model) performs slightly better than Model 3 (the socioeconomic variables model). While they have RSS values close to one another, Model 2 has a lower degree of freedom (it uses fewer variables in the regression). This is good. Achieving similar predictive accuracy with fewer variables means you have likely isolated some key predictive variables. The ANOVA test shows that the difference between Model 2 and the other models is statistically significant at the 1% level (p-value less than 0.01). In sum, although it is better to incorporate both the highly correlated health variables and the highly correlated socioeconomic variables, the health variables seem more predictive than the socioeconomic variables.

# Perform two-way ANOVA test
depression_anova_model <- anova(depression_model_1, depression_model_2, depression_model_3)
mhlth_anova_model <- anova(mhlth_model_1, mhlth_model_2, mhlth_model_3)


xkabledply(depression_anova_model)
Model: Res.Df ~ RSS + Df + Sum of Sq + F + Pr(&gt;F)
Res.Df RSS Df Sum of Sq F Pr(>F)
11204 69606 NA NA NA NA
11209 77498 -5 -7893 254 0
11206 77847 3 -349 NA NA
xkabledply(mhlth_anova_model)
Model: Res.Df ~ RSS + Df + Sum of Sq + F + Pr(&gt;F)
Res.Df RSS Df Sum of Sq F Pr(>F)
11202 11872 NA NA NA NA
11207 15343 -5 -3471 655 0
11206 16325 1 -982 NA NA
# health.socioecon.dep.anova <- anova(depression_model_2, depression_model_3)
# health.socioecon.mhlth.anova <- anova(mhlth_model_2, mhlth_model_3)
# 
# xkabledply(health.socioecon.dep.anova)
# xkabledply(health.socioecon.mhlth.anova)

3.2.2 Stepwise Selection (Tuning Multiple Linear Regression Model)

In the initial model, we included all variables with high correlation coefficients. However, the stepwise model identified a smaller set of predictor variables that are highly significant in predicting depression and poor mental health rates. We performed a stepwise model using the AIC criterion to identify the optimal combination of features. AIC accounts for the number of parameters in the model, which helps to prevent overfitting and avoid overly complex models.

# Perform stepwise selection using AIC as the selection criterion
depression_stepwise_model <- step(depression_model_1, direction = "both", trace = 0, k = log(nrow(train.set)), criterion = "AIC")

mhlth_stepwise_model <- step(mhlth_model_1, direction = "both", trace = 0, k = log(nrow(train.set)), criterion = "AIC")

3.2.3. Best Model vs. Stepwise Model

The stepwise model had fewer variables and demonstrated similar performance as the initial model when compared using an ANOVA test. The residual sum of squares (RSS) for the depression and poor mental health models were almost identical. Therefore, we chose the stepwise model over the initial model as it was more parsimony.

dep_anova_model_2 <- anova(depression_model_1, depression_stepwise_model)
mhlth_anova_model_2 <- anova(mhlth_model_1, mhlth_stepwise_model)

xkabledply(dep_anova_model_2)
Model: Res.Df ~ RSS + Df + Sum of Sq + F + Pr(&gt;F)
Res.Df RSS Df Sum of Sq F Pr(>F)
11204 69606 NA NA NA NA
11206 69632 -2 -26.5 2.13 0.118
xkabledply(mhlth_anova_model_2)
Model: Res.Df ~ RSS + Df + Sum of Sq + F + Pr(&gt;F)
Res.Df RSS Df Sum of Sq F Pr(>F)
11202 11872 NA NA NA NA
11203 11873 -1 -1.23 1.16 0.282

3.2.4. Best Model Performance

This section summarizes a summary of the models we considered to be the best performing based on the model selection to date.

For depression rates, our model identified five significant predictor variables: “OBESITY Rate”, “Poor physical health”, “high school graduate as highest level educational attainment”, “CT_>60”, and “percent of households interacting with the internet”. The model explained 37.6% of the variability in depression levels, which is a moderate amount.

For poor mental health rates, our model identified seven significant predictor variables: “LPA”, “Obesity”, “PHLTH”, “Sleep”, “ea.graduate”, “MI_income”, and “poverty”. The model explained 67.6% of the variability in poor mental health levels, indicating a strong relationship between the predictor variables and poor mental health.

Overall, our findings suggest that demographic, socio-economic, and health-related variables play a significant role in predicting depression and poor mental health rates in the United States. Our models can serve as a basis for further research on this topic and can assist policymakers in identifying high-risk populations and designing targeted interventions.

dep_predictions <- predict(depression_model_1, newdata = test.set)

# calculate the R-squared value
r_squared <- summary(depression_model_1)$r.squared

# calculate the mean squared error (MSE)
mse <- mean((test.set$DEPRESSION - dep_predictions)^2)

# print the results
cat("R-squared:", r_squared, "\n")
cat("MSE:", mse, "\n")

mhlth_predictions <- predict(mhlth_stepwise_model, newdata = test.set)

# calculate the R-squared value
r_squared <- summary(mhlth_stepwise_model)$r.squared

# calculate the mean squared error (MSE)
mse <- mean((test.set$MHLTH - mhlth_predictions)^2)

# print the results
cat("R-squared:", r_squared, "\n")
cat("MSE:", mse, "\n")

3.3. Regression Tree Model

We then generate our regression trees using our training data set. Because we do not preemptively select highly correlated variables (as we did when building the multiple regression models), we see a greater variety of predictive independent variables emerge. These include the economic index showing the effect of COVID, binge drinking (%), no proficient English speakers in households (%), never married (%), married with kid (%), attainment of graduate or professional degree (%), and having gotten a recent health checkup (%).

# Building a Regression Tree

# Step 1. Use recursive binary splitting to grow a large tree of depression on the training data
train.dep.tree <- rpart(DEPRESSION ~  ACCESS2 + BINGE + CHECKUP + DIABETES + LPA + OBESITY + PHLTH + SLEEP + STROKE + mt.nev.mar + mt.now.mar + MT_Divorces + MT_Separated + MT_Widowed + ea.less.hs.deg + ea.hs.deg + ea.col.ass.deg + ea.ba.deg + ea.grad.prof.deg + MI_Estimate + tot.pop + `CT_<10` + `CT_10-14` + `CT_15-19` + `CT_20-24` + `CT_25-29` + `CT_30-34` + `CT_35-44` + `CT_45-59` + `CT_>60` + ES_Total_labor_force + ES_Civilian_labor_force + ES_Civilian_labor_force_employed + ES_Civilian_labor_force_unemployed + ES_Armed_Forces + ES_Not_in_labor_force + land + urban + poverty + no.ins + disab + no.comp + `broad&comp` + no.eng + sing.mom + live.alone + pub.assist + no.phone + no.plumb + married.kid + hhd.no.comp + hhd.only.phone + hhd.no.int + hhd.broad + index_apr20, data = train.set, method = "anova")

# Grow large tree of mhlth on the training data 
train.mhlth.tree <- rpart(MHLTH ~ ACCESS2 + BINGE + CHECKUP + DIABETES + LPA + OBESITY + PHLTH + SLEEP + STROKE + mt.nev.mar + mt.now.mar + MT_Divorces + MT_Separated + MT_Widowed + ea.less.hs.deg + ea.hs.deg + ea.col.ass.deg + ea.ba.deg + ea.grad.prof.deg + MI_Estimate + tot.pop + `CT_<10` + `CT_10-14` + `CT_15-19` + `CT_20-24` + `CT_25-29` + `CT_30-34` + `CT_35-44` + `CT_45-59` + `CT_>60` + ES_Total_labor_force + ES_Civilian_labor_force + ES_Civilian_labor_force_employed + ES_Civilian_labor_force_unemployed + ES_Armed_Forces + ES_Not_in_labor_force + land + urban + poverty + no.ins + disab + no.comp + `broad&comp` + no.eng + sing.mom + live.alone + pub.assist + no.phone + no.plumb + married.kid + hhd.no.comp + hhd.only.phone + hhd.no.int + hhd.broad + index_apr20, data = train.set)

# Plot tree model
fancyRpartPlot(train.dep.tree, main = "Depression Rate Tree")

fancyRpartPlot(train.mhlth.tree, main = "Poor Mental Health Rate Tree")

# Print summary
summary(train.dep.tree)
summary(train.mhlth.tree)

Regression Tree Performance

Our depression rate regression tree has an R-squared value of 0.348 and a mean squared error (MSE) of 6.12 (indicating the difference between the actual and predicted values for the dependent variable). This indicates that approximately 35 percent of the variation in the depression rate dependent variable is associated with variation in the independent variables. On the other hand, our mental health rate tree has an R-squared value of 0.584 and an MSE of 1.35. So in comparison to one another, the mental health regression tree performs better.

However, by these metrics, the performance of both regression trees is inferior to that of their corresponding multiple linear regressions. For example, the depression multiple linear regression has an R-squared value of 0.376 and an MSE of 5.78. The poor mental health multiple linear regression has an R-squared value of 0.584 and an MSE of 1.35. According to these metrics, both regressions are superior to their corresponding regression trees.

# Generate predicted values for test set
test.set$pred.depression <- predict(train.dep.tree, newdata = test.set)

# Calculate evaluation metrics
MSE <- mean((test.set$DEPRESSION - test.set$pred.depression)^2)
R2 <- 1 - sum((test.set$DEPRESSION - test.set$pred.depression)^2) / sum((test.set$DEPRESSION - mean(test.set$DEPRESSION))^2)

# Print evaluation metrics
cat("R-squared:", R2, "\n")
cat("MSE:", MSE, "\n")

# The R-squared value of 0.399 suggests that the regression tree model explains 39.9% of the variation in the data, which is a moderate level of explanation.
# The MSE (Mean Squared Error) of 6.05 represents the average squared difference between the predicted values and the actual values, with a higher value indicating worse performance.

test.set$pred.mhlth <- predict(train.mhlth.tree, newdata = test.set)
# Calculate evaluation metrics
MSE <- mean((test.set$MHLTH - test.set$pred.mhlth)^2)
R2 <- 1 - sum((test.set$MHLTH - test.set$pred.mhlth)^2) / sum((test.set$MHLTH - mean(test.set$MHLTH))^2)

# Print evaluation metrics
cat("R-squared:", R2, "\n")
cat("MSE:", MSE, "\n")

# The regression tree model with MHLTH as the target variable has an R-squared value of 0.685, indicating that 68.5% of the variability in the MHLTH variable can be explained by the model. 
# The mean squared error (MSE) is 1.07, which means that on average, the model's predictions are off by 1.07 units from the actual values. .
# The root mean squared error (RMSE) is 1.04, which is the same as the standard deviation of the residuals.
# This indicates that the model's predictions have a standard deviation of 1.04 around the actual values. Overall, these metrics suggest that the model has a moderate level of predictive power for the MHLTH variable.

4. Result

Based on the tests and modeling performed previously, the factors that are highly associated with depression and poor mental health rates are:

Depression: Obesity, poor physical health, high school degree as the highest level of educational attainment, commute time >= 60 min, households without high-speed wireless internet, binge drinking, the economic effect of COVID-19, and no English. Poor Mental Health: Lack of physical activity, obesity, poor physical health, lack of sleep, BA as highest level educational attainment, graduate or professional degree as the highest level of educational attainment, never married, married with a kid, recent checkup, median income, poverty rate, and binge drinking. We can infer depression and poor mental health rates in a given tract based on the above factors with some accuracy. For depression, the variation of these factors is slightly associated (r-square: 0.37) with the variation in depression rate. For poor mental health, the variation of these factors is highly associated (r-square: 0.69) with the variation of poor mental health rate.

5. Discussion

In this section, we discuss the limitations of our research project and propose potential areas for further research.

5.1. Limitations

Based on the information we provided earlier, there are several limitations to this project: - The measure of the poor mental state over the last two weeks may need to be more precise because the data was combined from different datasets that were gathered more than two weeks apart. This could introduce measurement bias and affect the accuracy of the results. - The project should have performed pruning or training regression trees, which could have resulted in less accurate models. - The feature selection process used in the analysis could have been more optimal, which could result in irrelevant or redundant variables being included in the model, reducing its predictive power. The data used in the analysis may be representative of only some of the population, as it was collected from a specific geographic region or sample size. - The project was limited to some geographical regions, which may not be representative of the broader population or may have limited the ability to identify additional factors associated with depression and poor mental health. - The project did not account for potential confounding variables like the impact of cultural or social factors that could influence the relationship between the identified factors and depression or poor mental health, leading to the possibility of overestimating or underestimating the actual effect of these factors, which may be an essential factor to consider in future research.

5.2. Further Research

Some potential further research directions includes: - Conduct micro-level analysis: Further research could explore the social systems and relationships contributing to depression. This could involve examining individual-level factors, such as social support, family dynamics, and work stress, and how they impact mental health. - Integrate genetic information and other biological measures, such as hormonal imbalances and brain imaging. There is evidence that genetics play a role in the development of depression. Further research could examine genetic information and other biological factors related to depression. - Conduct more feature selection methods to identify the most important predictors of depression and poor mental health. As mentioned earlier, the feature selection process used in this project was limited. Further research could involve more sophisticated feature selection methods, such as principal component analysis or regularized regression techniques. - Explore additional factors: The current analysis focused on a limited set of factors in the dataset. Future research could explore additional factors related to depression, such as regular diet and lifestyle choices, and investigate their relationship to depression and poor mental health rates. - Time-series analysis: The current analysis did not consider the temporal aspect of depression and mental health rates. Future research could incorporate time-series analysis to examine how depression and mental health rates change and identify seasonal or cyclical trends that may shed light on environmental or societal changes. - Conduct qualitative research, such as interviews and focus groups, to better understand the subjective experiences of individuals living with depression and gain insights into potential risk factors that quantitative measures may not capture. - Explore the potential role of social support networks, such as family and friends, in mitigating or exacerbating depression and poor mental health, and develop interventions to strengthen social support.

Conclusion

In conclusion, this research project aimed to identify the factors associated with depression and poor mental health rates in the United States. The study utilized multiple linear regression models, regression trees, and ANOVA testing to analyze demographic, socioeconomic, and health-related variables. The results indicated that factors such as obesity, poor physical health, high school degree as the highest educational attainment, lack of physical activity, lack of sleep, and binge drinking were significantly associated with depression and poor mental health rates. Additionally, the project identified several limitations, including potential measurement bias, incomplete feature selection, and geographical restrictions, which may impact the accuracy of the results. The study contributes to the growing body of literature on the social determinants of mental health and provides a basis for further research and policy interventions. The findings can assist policymakers in identifying high-risk populations and designing targeted interventions. Future research directions include exploring the role of social support networks in mitigating or exacerbating depression, conducting qualitative research to understand better the subjective experiences of individuals living with depression, and integrating genetic and biological measures into the analysis. By identifying the critical factors associated with depression and poor mental health rates, this study provides valuable insights into the underlying mechanisms of mental health disparities. It offers potential solutions for addressing this pressing public health issue.